CD School House 10

home *** CD-ROM | disk | FTP | other *** search

/ CD School House 10 / CD School House - Education and Games (10.0) - Wayzata Technology (1995).iso / mac / DOS / MISC / MVSP13 / MVSP.DOC < prev next >

Wrap

Text File | 1994-03-03 | 53KB | 1,142 lines

_______________________________________________________ MMMMMMMMMMMM VV VV SSSSSSSS PPPPPPPP MM MM MM VV VV SS PP PP MM MM MM VV VV SSSSSSSS PPPPPPPP MM MM MM VV VV SS PP MM MM MM * VVV * SSSSSSSS * PP * _______________________________________________________ A MultiVariate Statistics Package for the IBM PC and Compatibles (C) Copyright Warren L. Kovach, 1986 Department of Biology Indiana University Bloomington, IN 47405 Ver. 1.3, Feb., 1986 This program is being distributed as user-supported software. If you find this program to be of value, a voluntary contribution ($25 suggested) would be appreciated. MVSP Ver. 1.3 -- User's Manual Page 2 CONTENTS -------- Introduction....................................................3 Acknowledgements................................................3 Disclaimer......................................................4 General Use of Program..........................................4 Main Menu Options.............................................5 A-E: Statistical Procedures.................................5 F: Change Drive or Sub-directory............................5 G: Change Program Defaults..................................5 H: HELP!....................................................7 Q: Quit MVSP................................................7 Data Files....................................................7 Data File Header:...........................................7 Data Labels:................................................8 Data File Titles:...........................................8 Data Matrix:................................................9 Running Statistical Procedures...............................10 Principle Components Analysis:.............................11 Reciprocal Averaging:......................................12 Dissimilarity and Similarities:............................13 Cluster Analysis:..........................................14 Diversity Indices:.........................................15 Future Plans...................................................15 8087 Support...................................................16 The User Supported Concept.....................................16 Appendix: Test Data Files......................................18 References.....................................................19 MVSP Ver. 1.3 -- User's Manual Page 3 INTRODUCTION MVSP is a package of common multivariate statistical procedures widely used in many areas of biology and geology, as well as other fields. These procedures include principle components analysis, reciprocal averaging, distance or dissimilarity measures, average-linkage cluster analysis, and diversity indices. These procedures are geared towards quick, simple analyses of small to medium sized data sets. Any heavy number crunching would be best suited for mainframe computers or some of the more sophisticated microcomputer statistical packages which are available. However, the price and simplicity of use of MVSP is hard to beat! I've tried to make this program as easy to use as possible. One possible drawback to ease of use is that some users may be very tempted to take a "black box" approach to using these statistics, feeding in numbers and coming up with "The Answer". I must strongly warn the users of this program that statistics can be DANGEROUS! All these procedures make assumptions about the data and have restrictions on what they can and cannot do. If these assumptions and restrictions are violated, the results could be meaningless. I urge you to become familiar with the methods and their assumptions before you use this program. This manual contains a list of references which I have found very useful in understanding these techniques. In particular, Sneath & Sokal (1973), Gauch (1982), and Pielou (1984) are very well written and give very clear discussions of these techniques. ACKNOWLEDGEMENTS This program is written in Turbo Pascal, and compiled using the version 3.0 compiler. The procedures for producing the pop- up menus and the disk directory listings are modified from Philip R. Burns' public domain procedures PIBMENUS and PIBDIR, both of which are incorporated into his PIBTERM program. These procedures are widely available on many electronic bulletin board services across the country. Check with your local users groups for more information, if you haven't already been bitten by the BBS bug. The assembly language procedure for direct memory video output is from Steve Hall's contribution to "PC-Magazine's" Power User column (Oct. 1, 1985). The eigenanalysis algorithm used in the principle components analysis and reciprocal averaging procedures is translated and modified from Orloci's (1978) BASIC programs. The scattergram procedure in the PCA and RA procedures is translated and modified from Cooke, Craven, and Clarke's book "Basic Statistical Computing", a very nice book with BASIC programs for doing numerous types of statistical analyses. The sort procedure used in the Spearman coefficient procedure is taken from Jim Savold's ZIPSORT procedure (ver. 1.1) MVSP Ver. 1.3 -- User's Manual Page 4 DISCLAIMER The accuracy of this program has of course been extensively tested against the results of other programs, but the results are not guaranteed. You may wish to initially also run comparisons with the results of other programs, using your own data set, to ensure that it is working properly with your type of data. We all know about those demons which manage to get into computer programs, causing foul-ups when we least suspect it! Note when running comparisons that there are often many methods of computing the same thing, and results may vary, especially in the more complex principle components and reciprocal averaging procedures. In principle components analysis, for instance, there are numerous ways of transforming the data before eigenanalysis, and the component loadings can be scaled either to unity (as they are here) or to the variance of that principle component. These differences may have great effects on the results, and should be kept in mind. If you do run into any problems with this program, whether they be in the results or abnormalities in the running of the program, please contact me at the address given on the title page, or through PC-LINK CENTRAL in Bloomington (812-824-7990), and give details of the problem and, if possible, the data set which you were running when the bug cropped up. Please note that no warranty is given for this program. The author (Warren L. Kovach) shall not be legally liable for any damages or lost profits arising from use or misuse of this program. GENERAL USE OF PROGRAM This program is a simple to use, menu-driven program which presents you with the possible options at each step. The program is initiated by typing the name of the program, MVSP, at the DOS prompt. Note that there are two files which are necessary for this program, MVSP.COM and MVSP.000, and these must both be on the default drive when the program is started. If you have changed any of the program defaults, the configuration file named MVSP.CNF (which is created when you save your changes) must also be on the default drive. When the program is loaded, you will see an introductory screen giving the name and address of the author, then you will be presented with a menu of available procedures. The first option on the menu will be highlighted by a rectangular cursor. This cursor can be moved up and down the list of options by using the up and down arrow keys on the numeric keypad of the keyboard. A choice of option is made by hitting the carriage return when the correct option is highlighted, or alternatively by typing the letter preceding the desired option. MVSP Ver. 1.3 -- User's Manual Page 5 MAIN MENU OPTIONS ================= OPTIONS A-E: The first five options are for the basic statistical procedures; PRINCIPLE COMPONENTS ANALYSIS, RECIPROCAL AVERAGING, SIMILARITIES AND DISSIMILARITIES, CLUSTER ANALYSIS, and DIVERSITY INDICES. These procedures are described later in this document. OPTION F: This option, CHANGE DRIVE OR SUB-DIRECTORY, allows you to temporarily change the drive and sub-directory on which the input and output data files will be found by default. If you enter a path name without a drive specification, the default drive is assumed. If you enter just a drive specification (e.g. "A" or "A:" or "A:\") the default path will be the root directory of that drive. A "?" lists the sub-directories in the currently logged directory. A carriage return with no other input exits this option with no changes. OPTION G: The CHANGE PROGRAM DEFAULTS option allows you to change the initial default colors, path name, and data file extensions. These default specifications can be saved to the file MVSP.CNF, which will be reloaded each time the program is run, reinstating these defaults. When you choose this option you will be presented with a menu asking which type of default should be changed. DEFAULT COLORS allows you to change the color of the regular text and background, the menu text and background, and the menu frame. Choosing one of these will cause a menu of available colors to appear. You can experiment with color combinations easily, quitting the color menu when you are satisfied. Note that option "F" on the menu resets black and white colors, which are the defaults if the MVSP.CNF configuration file is not found. This option can be useful in case you get yourself into a color combination that is so unreadable that you can't see the options available! DEFAULT DATA FILE PATH changes the default path used for data files, just like option F above. However, this option allows you to save this specification for future use, while option F is for temporary changes. If you are using a two floppy disk system, it is often most useful to have the program files in drive A:, and to have the default data file path set to B:, so that data files are on another disk. If you have a hard disk, you could have the program files in a subdirectory named C:\MVSP (which would be the default directory when you invoke the program) and the data either on a floppy in drive A: or B:, or in a hard disk directory named C:\MVSP\DATA. You would then specify the default data file MVSP Ver. 1.3 -- User's Manual Page 6 path through this option. You can even set up separate directories for different types of data, which is where the temporary path change option (option F) would come in handy. You can always override the default path option by either changing it through options F or G, or by specifying the drive and path when you are asked for the name of the data file when running one of the statistical procedures. DEFAULT DATA FILE EXTENSIONS allows you to change the default extensions for your input and output files. I personally prefer *.DAT for input files and *.OUT for output files (these are the internal defaults used if MVSP.CNF are not found), but you can easily change this and save your changes. The cluster analysis program can have different defaults, which facilitates the input of similarity or dissimilarity coefficients from this program to the cluster procedure. The coefficients program can output a symmetrical matrix to a file in the form required by the cluster procedure. The filename extension of this file will default to the extension which you specify for cluster analysis input (*.DIS is the internal default). Thus, to perform a cluster analysis of the file DATA.DAT, you need only to enter the name DATA in both the similarity procedure and the cluster procedure. The similarity matrix will be calculated for DATA.DAT, placed in DATA.DIS, and read from DATA.DIS by the cluster procedure. The output file for the cluster program can also have its own default extension (*.CLS is the internal default). Entering a blank carriage return for the output file extensions will direct output to the default printer (Lst) instead of a file. Entering "NUL" will nullify any hard copy output, and you will only see the results printed to the screen. MINIMUM EIGENVALUE allows you to control the number of components which are printed out in the PCA and RA procedures by changing the value for the minimum eigenvalue. More on this in the section on PCA. REREAD CONFIGURATION FILE will reread the MVSP.CNF configuration file which contains the user default settings. This will reinstate the default settings which are normally active when the program is initiated. This can be handy if you have made a lot of changes to defaults during a session (without saving them!) and you wish to return to your old defaults. SAVE DEFAULTS TO FILE MVSP.CNF will save any changes in the defaults to a configuration file, which will be reloaded every time the program is run. If this file is not found on in the same directory as the other MVSP program files, the internal defaults will be set. If any changes are made to the defaults, and you attempt to exit the configuration menu without saving them, you will be reminded that these new defaults have not been saved and given the option to return and save these options, or continue back to the main menu. HELP! will provide abbreviated descriptions of the options of the configuration menu. MVSP Ver. 1.3 -- User's Manual Page 7 QUIT CONFIGURE will return you to the main menu. OPTION H: HELP! will provide descriptions of the main menu options as well as information about the expected format of the data files and the author's name and address. OPTION Q: QUIT MVSP will exit the MVSP program and return to the DOS prompt. DATA FILES ========== The input data files should be ASCII text files which can be created with the DOS line editor EDLIN, or many other word processors, such as PC-WRITE or XYWRITE. Some word processors, such as WORDSTAR, modify some characters to special formatting characters ("high bits"). These modified characters will not be able to be read by MVSP. You can check whether your word processor is one of these by listing a word processed file with the DOS TYPE command and looking for strange characters. If your word processor uses these extra characters, make sure you create your data file in a non-document mode which creates normal ASCII files. You may also maintain your data with spreadsheet or database programs, such as LOTUS 123. Most of these have an option for printing data to ASCII files, which can then be modified to the appropriate format for MVSP (mainly by adding the file header information, discussed below). This can greatly expedite data management and manipulation, making it easier to select species or sites to be analyzed. DATA FILE HEADER: The first line of the data file should be a header line, which will give the program some information about the data, such as the number of rows and columns. It should look something like this: * 10 15 This header line should begin with an asterisk ("*") in the first column of the first line of the file. This asterisk tells the program that a header is present. If the asterisk is not found, the program assumes that the header information is not present, and it will prompt the user for the information. The two numbers are the number of rows and columns in the data matrix. The above example has 10 rows and 15 columns. MAKE SURE that if this header information is present, there is an asterisk before it; if MVSP Ver. 1.3 -- User's Manual Page 8 not, the header information will be read as data! You may also include data labels in the data file. These labels will be printed on your output to help make sense of the masses of numbers which will be spewed out. If labels are included, this must be specified in the file header. For example: *L 10 15 specifies a data file which includes data labels and which has 10 rows and 15 columns (NOT including the labels themselves). The "L" must come immediately after the "*", with no intervening spaces, or it will be read as the number of rows, and an error will occur. The numbers of rows and columns must be separated by at least one space from each other. DATA LABELS: The column and row labels themselves can be up to 8 characters long and may consist of any printable character, except spaces. The following are all valid labels: ROW1 COLUMN_2 1st-Loc. #3-Site This label is NOT valid: SITE 1 It will be read as two labels, "SITE" and "1". The column labels should be in the second row of the data file, after the header line, and the labels should be separated by at least one space. The labels may be continued onto subsequent lines; the program will continue reading column labels until it has read as many as the number of columns you have specified in the header line. Row labels occur on the same line as the data row to which they apply, and should precede the first datum in that row, with a space separating the label and datum. DATA FILE TITLES: A title may also be added to your data file on the header line, so that you know what this data represents. Here's an example *L 10 15 Test data file for MVSP This title, "Test data file for MVSP", will be listed to the screen and placed on the output when that file is selected. It MVSP Ver. 1.3 -- User's Manual Page 9 must be separated from the other elements of the header by at least one space, and it cannot be more than 70 characters long. The dissimilarities procedure will also place this title in the header of the matrix output file, along with the specification of which coefficient was used, so that the title is carried over to the clustering program. DATA MATRIX: The data matrix itself should consist of the data points separated by at least one space. The data for one row can be continued on the next line. If the number of rows or columns you specify is wrong, the data matrix will be read wrong, often without warning. If you have a 10x10 matrix and specify 9 columns by mistake, the last datum on the first row will be read as the first datum of the second row, and so on. This, needless to say, can raise havoc with your results! BE CAREFUL! All procedures can print out the raw data so that you can check to make sure it was read correctly. Here is an example data file: *L 5 10 Test data set for MVSP COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 ROW1 23 2 4 53 6 45 2 3 67 5 ROW2 10 2 4 34 1 4 3 10 20 3 ROW3 2 34 0 1 35 12 1 90 10 9 ROW4 98 12 10 4 10 9 10 5 20 31 ROW5 1 7 9 11 75 7 5 21 0 10 The input data files for the cluster analysis program use a slightly different header format. Here is an example: *L 15 DIS Test data set for MVSP Since the clustering program uses a symmetrical matrix as input, it only needs one number for the size of the data matrix. In this case the size of the matrix is 15x15. The third element of the header is a three letter phrase specifying whether the matrix is a similarity (SIM) or dissimilarity (DIS) matrix. This code MUST be separated from the number of objects by only one space, or it will not be read correctly. The dissimilarity and similarity procedure of this program automatically sets up its output files in this manner for input into the clustering procedure. Here is an example of a clustering input file, generated from an analysis of the above matrix, using the Spearman Rank Order Correlation Coefficient: MVSP Ver. 1.3 -- User's Manual Page 10 *L 10 SIM Test data set for MVSP - SPEARMAN COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 1.00 -0.15 1.00 0.36 -0.05 1.00 0.20 -0.97 0.05 1.00 -0.60 0.67 0.15 -0.60 1.00 0.30 0.21 -0.31 -0.00 0.10 1.00 0.30 -0.05 0.97 0.00 0.10 -0.50 1.00 -0.80 0.62 -0.41 -0.70 0.60 -0.30 -0.30 1.00 0.82 -0.55 -0.03 0.62 -0.82 0.41 -0.10 -0.87 1.00 0.10 0.67 0.67 -0.60 0.70 0.10 0.60 0.10 -0.41 1.00 Note that this file is a lower half matrix, with diagonals (the 1.00's) included. Other forms of matrices may also be specified for input to the clustering program, as discussed below, but this is the default output form of the similarities and dissimilarities procedure. RUNNING STATISTICAL PROCEDURES ============================== When one of the statistical procedure options (A-F) are chosen, you will first be asked for the name of the input data file. You may obtain a directory of the default data disk and path by typing a "?". You may then specify a certain file mask (such as *.DAT for all files with a .DAT extension) or simply hit the carriage return for all files. You may then enter the name of the data file. The program will automatically add your specified default extension if no extension is specified. So, if your datafile is named "STUDY1.DAT" and your default extension is *.DAT, you need only type "STUDY1". If you specify another extension, or have a filename with no extension, the program will recognize those as long as the full name is specified. A blank carriage return here will return you to the main menu. If you have elected (through the configuration menu) to have output sent to the printer, then you will be prompted to make sure that your printer is ready, and you will then go into the statistical procedure you have selected. If you have instead specified a default output file extension, you will next be prompted for the name of the output file. If you enter a blank carriage return, this output file will default to the input file name plus the default output file extension you have specified. The output file for an analysis of STUDY1.DAT will default to STUDY1.OUT if your default output extension is *.OUT. If you have chosen to run the dissimilarities procedure, you will also be asked if you wish to have the results input into the clustering procedure. If so, another filename must be specified to contain just the distance matrix, with none of the ancillary information. This filename MVSP Ver. 1.3 -- User's Manual Page 11 defaults to the default extension for the cluster analysis input files. After the book-keeping business is taken care of, you will then enter the actual procedure which you have chosen. These will be discussed separately. PRINCIPLE COMPONENTS ANALYSIS: This procedure performs a simple R-mode principle components analysis. The component loadings are scaled to unity, so that the sum of squares of an eigenvector equals 1, and the component scores are scaled so that the sum of squares equals the eigenvalue. Q-mode PCA will have the opposite scaling. Note that many packages, such as SPSS and SYSTAT, perform Q-mode PCA, and thus their eigenvectors will be scaled to the eigenvalue, rather than unity. Note also that the data matrices for MVSP must be transposed for use with packages such as SPSS or SYSTAT to obtain the same eigenvalues. For details on the computation and assumptions of the PCA technique, see Orloci (1978), Gauch (1982), and Pielou (1984). Orloci gives a detailed mathematical discussion of the particular algorithm used here, while Gauch and Pielou give very clear and understandable discussions of the basis of the technique and its use and assumptions. The size of data matrix which can be analyzed is limited to 55x55 (45x45 for the 8087 version). In the R-mode analysis, similarity coefficients are calculated for the descriptors, which are the rows of the matrix (species in an ecological study, characters in a numerical taxonomic study) and component scores are calculated for the objects, which are the columns of the matrix (samples or operational taxonomic units (OTU's)). You will first be asked if you wish to have the raw data and the similarity matrix printed out. In analyses of large data sets, the printing of the data and similarity matrix can add a little bit of time to the analysis, as well as a hefty pile of paper. I find it useful to see this output, however, particularly to check to see if the data was read correctly. Next, you will be asked if you want the data to be log transformed. PCA assumes a normal distribution of the data, but this assumption is often not met. Log transforming the data can reduce the skewness of the data, resulting in a more interpretable analysis (Spicer & Hill, 1979). In my research with fossil plant data, I've found this to be invaluable, as I always have some samples with extremely high abundances of certain taxa, and these taxa tend to dominate the analysis due to their large numbers. Log transforming the data evens this out. You are given the option of what base of logarithm to use. When the procedure is run, you will have the option of using either a covariance or correlation matrix, and of using either a centered or uncentered data matrix. Generally a centered MVSP Ver. 1.3 -- User's Manual Page 12 covariance matrix is used, but if different units of measurement are used in the data matrix, these will need to be standardized, and thus a correlation matrix should be used. Standardization may also be desired to reduce the effects of dominant species, so that rarer species play a greater role in the resulting configuration. An uncentered data matrix is called for when there is appreciable between-axes heterogeneity. This means that different clusters of points are associated with different axes, and have little projection on other axes. This often occurs when different groups of samples have completely different sets of common species, with little overlap. See Pielou (1984) for more on this. Status messages will be listed to the screen during the analysis to let you know how things are proceeding. The final results will also be listed out while they are being saved to the output file or sent to the printer. The eigenvalues and their percentage of the total variation will be printed along with the component coefficients (or eigenvectors), then the component scores for each principle component will be printed. You may choose the minimum eigenvalue for which principle components are printed out. The internal default is to print components only if the eigenvalue is greater than the average eigenvalue. This is often considered a good rule of thumb for determining whether a component is interpretable (Legendre & Legendre, 1983). You may change this default through the program defaults option (G) on the main menu. A value of 0 will cause all components to be printed out, and any other value, such as 1, may also be entered as a minimum eigenvalue. This minimum value may be saved in the MVSP.CNF configuration file along with the colors and default datafile paths and extensions. You may also have the component loadings and component scores plotted on a scatter diagram. You will be asked how many axes you wish to have plotted. If you choose three, for instance, the first three axes will be plotted against each other in every combination of two dimensional plots (3 plots in this case, 6 for four axes, etc.). Entering a zero will bypass the plotting procedure. After the component plots, the raw data will be printed out sorted by the first component scores and factors. This can be useful for allowing you to see patterns and trends in the raw data alone. If the first component accounts for a large proportion of the variance, and if there is an interpretable gradient along the first axis, then this pattern can be striking. RECIPROCAL AVERAGING: The reciprocal averaging procedure performs an eigenanalysis form of reciprocal averaging. Again, see Orloci (1978), Gauch (1982) and Pielou (1984) for details on this procedure. The setup and usage of this procedure is similar to the PCA procedure, with some differences. This procedure uses more computer memory, with the result that the largest matrix which MVSP Ver. 1.3 -- User's Manual Page 13 can be analyzed is 45x45 (40x40 for the 8087 version). There are also a few more options available. The analysis can be run with either a weighting of the rare species or the common species. See Orloci (pp. 152-168) for details of these methods of weighting. Also, the scores can be adjusted to to percentages, to approximate the results of the original RA algorithm as put forth by Hill (1973). The data file should have species as the rows and samples as the columns, as in the PCA procedure. DISSIMILARITY AND SIMILARITIES: This program calculates a variety of dissimilarity and similarity measures. There are presently six measures available. These procedures, and their formulas are: Euclidean distance: EDjk = SQRT [ SUMi SQR (Xij - Xik) ] Cosine theta (or normalized Euclidean) distance: CDjk = SQRT [ SUMi SQR (Xij / Yj - Xik / Yk) ] where Yx = SQRT [ SUMx SQR (Xix) ] Manhattan metric distance: MMDjk = SUMi [ ABS (Xij - Xik) ] Canaberra metric distance: CMDjk = SUMi [ ( ABS (Xij - Xik) ) / (Xij + Xik) ] Spearman rank order correlation coefficient: SCCij = 1 - [ ( 6 * SUMk SQR (Rik - Rjk) ) / (CUBE (N) - N) ] where R = rank of variable Pearson product moment correlation coefficient: PCCij = [ SUMk (Xik - MEAN (Xi) ) * (Xjk - MEAN (Xj) ] / [ SQRT ( SUMk SQR (Xik - MEAN (Xk) ) ) * SQRT ( SUMk SQR (Xjk - MEAN (Xk) ) ) ] (X = data value; ABS = absolute value; SQR = square; SQRT = square root; MEAN = mean; CUBE = cubed; SUM = sumation ) See Sneath & Sokal (1973), Pielou (1984), and Prentice (1980) for discussions and derivations of these measures. The maximum size of data matrix allowed is 95x95 (85x85 for the 8087 version). The distances are calculated between the columns of the data matrix. An option to transpose the data matrix before the MVSP Ver. 1.3 -- User's Manual Page 14 analysis is included, to allow analysis of the rows without requiring reentry of the data. This procedure is set up to allow easy input of the distance measures into the clustering analysis procedure. If you choose to input the distance matrix into the clustering program, a copy of the distance matrix along with the appropriate header information will be put into a separate file from the full output. This matrix file can then be used as input to the clustering program. CLUSTER ANALYSIS: This procedure performs average linkage cluster analysis on an input matrix of some sort of distance or similarity measure. Four forms of average linkage clustering are presently available, unweighted pair group, unweighted centroid, weighted pair group, and weighted centroid (or median). For clear and concise explanations of the theory and practice behind cluster analysis, see Sneath and Sokal (1973) and Pielou (1984). The largest data matrix this program can handle is 95x95 (85x85 for the 8087 version). A number of different input formats are available, including various forms of half matrices and full matrices (a lower half matrix with a diagonal, the output form of the dissimilarity procedure, is the default). You must also specify whether the input measure is a similarity or dissimilarity measure (if it isn't specified in your data file header). The output of the procedure consists of a report of the status of the clustering procedure as each new object is added to the cluster. The average similarity or dissimilarity of the two groups which have just been joined is printed out, along with a listing of the two groups and the number of objects in the newly fused group. If a single object is added to another cluster, the label for that object (or a numerical label corresponding to its position in the data matrix) is printed out. If a whole group is added, the node at which that group was last added to is printed out. For instance, a report such as: NODE GROUP 1 GROUP 2 1 COL1 COL2 2 COL4 COL5 3 NODE 1 COL3 4 NODE 3 NODE 2 would correspond to a dendrogram of the form: MVSP Ver. 1.3 -- User's Manual Page 15 COL1 COL2 COL3 COL4 COL5 | | | | | ------- | ------- | | | ---------- | | | --------------- | The actual lengths of the branches of this dendrogram would depend on the average similarity of each group as they are fused. The dendrogram can be reconstructed by hand, or the dendrogram can be plotted using computer graphics programs. Joseph Felsenstein's cladistic package PHYLIP contains a program written by Christopher Meacham for drawing cladograms and dendrograms. See Felsenstein (1985) for details on the availability of this free package. DIVERSITY INDICES: This procedure computes three of the most commonly used diversity indices used in ecology, Simpson's, Shannon's, and Brillouin's. See Pielou (1969) for a discussion of the use and derivation of these indices. The input data file should be set up with species as rows and samples as columns. The diversity, then, is calculated for each column. The largest data matrix which can be processed is 95x95 (85x85 for the 8087 version). Be forewarned that the Brillouin index calculates factorials of the species abundances, and if any of your abundances are high, this could take a VERY LONG TIME! Data matrices with numerous species abundances on the order of hundreds or thousands could make for a rather long coffee break! The output consists not only of the diversity index, but also the number of species and the evenness, which is defined as the diversity divided by the log of the number of species (Pielou, 1969)). FUTURE PLANS My plans for future versions of this program include adding character graphics procedures for the clustering procedure and adding more coefficients to the dissimilarities and similarities procedure. I am also considering adding Bray & Curtis polar ordination, with some of the modifications which have been suggested by Beals (see his 1984 paper for summaries), as well as detrended correspondence analysis (see Hill & Gauch, 1980). Any comments on favorite statistics out there? Let me know what you would like to see in this program. I also hope to figure out a way to increase the size of the data matrices that this program accept. They are now limited by the 64K limit that Turbo Pascal imposes for the size of the data MVSP Ver. 1.3 -- User's Manual Page 16 and stack segments. My attempts to use memory outside of that 64K space for the data matrices have met with some very strange results (including one time when my screen began flashing a psychedelic pattern of ASCII characters while the computer proceeded to trash out my data disk; see what I mean by demons?). If you have any other comments about the procedures in this program, or about procedures NOT in this program, which you feel would be useful to include, these should be sent to me at the address on the title page of this manual. THANK YOU! 8087 SUPPORT If you aren't satisfied with the speed of this program, a faster version which uses the 8087 math coprocessor is available. This coprocessor (which is an optional chip that can be plugged into your computer and costs anywhere from $100-$200) greatly speeds up the processing of real number, floating point arithmetic. Often this increase in speed can amount to 10 times! Turbo Pascal, the compiler used for this package, offers a special compiler which creates programs which take advantage of this processor. The programs compiled with this special compiler will only work on machines which have the 8087 installed. They also will have lower limits on the data matrix size, since the 8087 version of Turbo Pascal uses more memory to store each number (and hence has a greater accuracy in its computations). A version of this program which has been compiled for the 8087 is available to registered users (those who have made a voluntary monetary contribution; see below). If you are working with smaller matrices (maximum matrix sizes are specified in the procedure descriptions above), then this could speed things up a good bit. For example, a PCA of a 45x45 data matrix took one hour with the normal version of the program, but only twenty minutes with the 8087 version. THE USER SUPPORTED CONCEPT This software package is being distributed under the user supported concept. In case you haven't run across this software phenomenon, the following is a brief discussion of it's tenets. User supported software is an experiment in "grass-roots" software distribution and development. Andrew Fluegelman, one of the pioneers of this phenomenon, expressed it this way: 1) The value and utility of software is best assessed by the user on his or her own system. 2) The creation of new and useful software should be supported by the computing community. 3) Copying and sharing of software that you have found useful should be encouraged, rather than restricted. MVSP Ver. 1.3 -- User's Manual Page 17 User supported programs, such as this, are freely distributed to the computing community, through the network of electronic bulletin board services, local computer user groups, word of mouth, and networks of friends with similar interests. The user support comes in two forms: 1) The user is encouraged to evaluate the program, suggest to the author any changes in the program which would be useful, and recommend the program to others if it is worth recommending. 2) The user is encouraged to support further programming efforts (including enhancements of this program) through a voluntary monetary contribution to the program author. User supported means that you don't have to pay outrageous prices for a commercial package without even getting a chance to test drive it first to see if it really meets your needs. User supported means that if YOU, the user, decides that this program is worth supporting, then you support it voluntarily, for a reasonable cost, and without the hassles of copy-protection and the high cost of advertising. You are encouraged to copy and distribute this program. If you find this program to be useful, a voluntary contribution to the author ($25 suggested) would be appreciated. This program is copyrighted, and no price may be charged for this program by any person other than the author (Warren L. Kovach). A nominal fee may be charged for distribution costs, such as for the media and postage and handling, as long as this fee does not exceed $5. All registered users (users who have made the voluntary contribution of $25 or more) will be placed on my mailing list, and they will be notified of new versions and new features of this program, and will be entitled to upgrades to newer versions for only the cost of postage and the disk (about $5). They will also be entitled to versions of the program compiled for the 8087 math coprocessor, also for only the postage and media cost. Thank you for supporting MVSP! MVSP Ver. 1.3 -- User's Manual Page 18 APPENDIX: Test Data Files The following are listings of some example data files which are distributed with MVSP. These data files are taken from the published literature, and the user may compare the MVSP results with those of the original analyses. File JOLIMOSI.DAT: These data are taken from Jolicoeur & Mosimann (1960), a pioneering study using PCA in morphometrics. The data are measurements (in millimeters) of the length, width, and height of the carapices of 24 male painted turtles (Chrysemys picta marginata). They interpret the first PC as corresponding to size increase (growth) while the second & third PC's are interpreted as shape variation. *L 3 24 Turtle carapice data from Jolicoeur & Mosimann, 1960, males. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24 LENGTH 93 94 96 101 102 103 104 106 107 112 113 114 116 117 117 119 120 120 121 125 127 128 131 135 WIDTH 74 78 80 84 85 81 83 83 82 89 88 86 90 90 91 93 89 93 95 93 96 95 95 106 HEIGHT 37 35 35 39 38 37 39 39 38 40 40 40 43 41 41 41 40 44 42 45 45 45 46 47 File GAUCH.DAT: These data are taken from Gauch (1982). These are composite samples of upland forest communities from southern Wisconsin, taken from a pioneer (sample 1) to climax (sample 10) gradient. He uses these data to demonstrate many different ordination techniques. He doesn't analyze these data with RA, but he does use detrended correspondence analysis on these data, with similar results to MVSP's RA program (particularly on the first axis). *L 14 10 Wisconsin forest communities data from Gauch, 1982, Table 4.4 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 QUER.MAC 9 8 3 5 6 0 5 0 0 0 QUER.VEL 8 9 8 7 0 0 0 0 0 0 CARY.OVA 6 6 2 7 0 2 0 0 0 0 PRUN.SER 3 5 6 6 6 4 5 0 4 1 QUER.ALB 5 4 9 9 7 7 4 6 0 2 JUGL.NIG 2 0 0 0 3 5 6 4 3 0 QUER.RUB 3 4 0 6 9 8 7 6 4 3 JUGL.CIN 0 0 5 0 2 0 0 2 0 2 ULMU.AME 2 2 4 5 6 0 5 0 2 5 TILI.AME 0 0 0 0 2 7 6 6 7 6 ULMU.RUB 4 0 2 2 5 7 8 8 8 7 CARY.COR 0 0 0 0 0 5 6 4 0 3 OSTR.VIR 0 0 0 0 0 0 7 4 6 5 ACER.SAC 0 0 0 0 0 5 4 8 8 9 MVSP Ver. 1.3 -- User's Manual Page 19 REFERENCES Beals, E.W., 1984. Bray-Curtis Ordination: An Effective Strategy for Analysis of Multivariate Ecological Data. Adv. in Ecol. Research, 14:1-55. Cooke, D., Craven, A.H., & Clarke, G.M., 1982. Basic Statistical Computing. Edward Arnold (Publishers) Ltd., London. Felsenstein, J., 1985. Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution 39:783-791. Gauch, H.G. Jr., 1982. Multivariate Analysis in Community Ecology. Cambridge University Press, New York. Greig-Smith, P., 1983. Quantitative Plant Ecology. University of California Press, Berkely. Hill, M.O., 1973. Reciprocal Averaging: An Eigenvector Method of Ordination. Journal of Ecology, 61:237-249. Hill, M.O., & Gauch, H.G. Jr., 1980. Detrended Correspondence Analysis: An Improved Ordination Technique. Vegetatio 42:47- 58. Jolicoeur, P., & Mosimann, J.E., 1960. Size and Shape Variation in the Painted Turtle. A Principle Component Analysis. Growth, 24:339-354. Legendre, L., & Legendre, P., 1983. Numerical Ecology. Elsevier Scientific Publishing Company, New York. Orloci, L., 1978. Multivariate Analysis in Vegetation Research, 2nd edition. W. Junk, Boston. Pielou, E.C., 1969. An Introduction to Mathematical Ecology. Wiley-Interscience, New York. Pielou, E.C., 1984. The Interpretation of Ecological Data. Wiley-Interscience, New York. Prentice, I.C., 1980. Multidimensional Scaling as a Research Tool in Quaternary Palynology: A Review of Theory and Methods. Review of Paleobotany & Palynology, 31:71-104. Sneath, D.H., & Sokal, R.R., 1973. Numerical Taxonomy. W.H. Freeman & Co., San Francisco. Spicer, R.A., & Hill, C.R., 1979. Principle Components and Correspondence Analysis of Quantitative Data from a Jurassic Plant Bed. Review of Paleobotany & Palynology, 28:273-299.